This code covers chapter 7 of “Introduction to Data Mining” by Pang-Ning Tan, Michael Steinbach and Vipin Kumar. See table of contents for code examples for other chapters.

CC This work is licensed under the Creative Commons Attribution 4.0 International License. For questions please contact Michael Hahsler.

Prepare Data

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.0     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

We will use here a small and very clean dataset called Ruspini which is included in the R package cluster.

data(ruspini, package = "cluster")

The Ruspini data set, consisting of 75 points in four groups that is popular for illustrating clustering techniques. It is a very simple data set with well separated clusters. The original dataset has the points ordered by group. We can shuffle the data (rows) using sample_frac which samples by default 100%.

ruspini <- as_tibble(ruspini) %>% sample_frac()
ruspini
## # A tibble: 75 x 2
##        x     y
##    <int> <int>
##  1    47   149
##  2   110   111
##  3    28   147
##  4   108   111
##  5    38   143
##  6    61    15
##  7    13    69
##  8    78    16
##  9   115   117
## 10    58    13
## # … with 65 more rows

Data cleaning

ggplot(ruspini, aes(x = x, y = y)) + geom_point()

summary(ruspini)
##        x                y         
##  Min.   :  4.00   Min.   :  4.00  
##  1st Qu.: 31.50   1st Qu.: 56.50  
##  Median : 52.00   Median : 96.00  
##  Mean   : 54.88   Mean   : 92.03  
##  3rd Qu.: 76.50   3rd Qu.:141.50  
##  Max.   :117.00   Max.   :156.00

For most clustering algorithms it is necessary to handle missing values and outliers (e.g., remove the observations). For details see Section “Outlier removal” below. This data set has not missing values or strong outlier and looks like it has some very clear groups.

Scale data

Clustering algorithms use distances and the variables with the largest number range will dominate distance calculation. The summary above shows that this is not an issue for the Ruspini dataset with both, x and y, being roughly between 0 and 150. Most data analysts will still scale each column in the data to zero mean and unit standard deviation (z-scores). Note: The standard scale() function scales a whole data matrix so we implement a function for a single vector and apply it to all numeric columns.

# I use this till tidyverse implements a scale function
scale_numeric <- function(x) x %>% mutate_if(is.numeric, function(y) as.vector(scale(y)))

ruspini_scaled <- ruspini %>% scale_numeric()
summary(ruspini_scaled)
##        x                  y           
##  Min.   :-1.66806   Min.   :-1.80743  
##  1st Qu.:-0.76649   1st Qu.:-0.72946  
##  Median :-0.09442   Median : 0.08158  
##  Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.70879   3rd Qu.: 1.01582  
##  Max.   : 2.03655   Max.   : 1.31355

After scaling, most z-scores will fall in the range \([-3,3]\) (z-scores are measured in standard deviations from the mean), where \(0\) means average.

Clustering methods

k-means Clustering

k-means implicitly assumes Euclidean distances. We use \(k = 10\) clusters and run the algorithm 10 times with random initialized centroids. The best result is returned.

km <- kmeans(ruspini_scaled, centers = 4, nstart = 10)
km
## K-means clustering with 4 clusters of sizes 20, 15, 17, 23
## 
## Cluster means:
##            x          y
## 1 -1.1385941 -0.5559591
## 2  0.4607268 -1.4912271
## 3  1.4194387  0.4692907
## 4 -0.3595425  1.1091151
## 
## Clustering vector:
##  [1] 4 3 4 3 4 2 1 2 3 2 4 4 2 1 2 3 4 4 4 1 4 3 4 4 4 3 3 4 1 1 2 1 3 4 4 4 3 4
## [39] 2 1 4 3 1 3 4 3 3 1 2 4 1 1 1 1 1 1 2 1 2 2 1 2 3 1 3 1 4 3 1 3 4 4 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 2.705477 1.082373 3.641276 2.658679
##  (between_SS / total_SS =  93.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

km is an R object implemented as a list. The clustering vector contains the cluster assignment for each data row and can be accessed using km$cluster. I add the cluster assignment as a column to the scaled dataset (I make it a factor since it represents a nominal label).

ruspini_clustered <- ruspini_scaled %>% add_column(cluster = factor(km$cluster))
ruspini_clustered
## # A tibble: 75 x 3
##         x      y cluster
##     <dbl>  <dbl> <fct>  
##  1 -0.258  1.17  4      
##  2  1.81   0.390 3      
##  3 -0.881  1.13  4      
##  4  1.74   0.390 3      
##  5 -0.553  1.05  4      
##  6  0.201 -1.58  2      
##  7 -1.37  -0.473 1      
##  8  0.758 -1.56  2      
##  9  1.97   0.513 3      
## 10  0.102 -1.62  2      
## # … with 65 more rows
ggplot(ruspini_clustered, aes(x = x, y = y, color = cluster)) + geom_point()

Add the centroids to the plot.

centroids <- as_tibble(km$centers, rownames = "cluster")
centroids
## # A tibble: 4 x 3
##   cluster      x      y
##   <chr>    <dbl>  <dbl>
## 1 1       -1.14  -0.556
## 2 2        0.461 -1.49 
## 3 3        1.42   0.469
## 4 4       -0.360  1.11
ggplot(ruspini_clustered, aes(x = x, y = y, color = cluster)) + geom_point() +
  geom_point(data = centroids, aes(x = x, y = y, color = cluster), shape = 3, size = 10)

Use the factoextra package for visualization

library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_cluster(km, data = ruspini_scaled, centroids = TRUE, repel = TRUE, ellipse.type = "norm")
## Warning: ggrepel: 9 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Inspect clusters

We inspect the clusters created by the 4-cluster k-means solution. The following code can be adapted to be used for other clustering methods.

Cluster Profiles

Inspect the centroids with horizontal bar charts organized by cluster. To group the plots by cluster, we have to change the data format to the “long”-format using a pivot operation.

ggplot(pivot_longer(centroids, cols = c(x, y), names_to = "feature"),
  aes(x = value, y = feature)) +
  geom_bar(stat = "identity") +
  facet_grid(rows = vars(cluster))

Extract a single cluster

You need is to filter the rows corresponding to the cluster index. The next example calculates summary statistics and then plots all data points of cluster 1.

cluster1 <- ruspini_clustered %>% filter(cluster == 1)
cluster1
## # A tibble: 20 x 3
##         x       y cluster
##     <dbl>   <dbl> <fct>  
##  1 -1.37  -0.473  1      
##  2 -0.881 -0.658  1      
##  3 -1.01  -0.699  1      
##  4 -1.31  -0.350  1      
##  5 -0.914 -0.411  1      
##  6 -1.18  -0.555  1      
##  7 -1.21  -0.637  1      
##  8 -0.816 -0.822  1      
##  9 -0.750 -0.637  1      
## 10 -0.914 -0.760  1      
## 11 -0.619 -0.411  1      
## 12 -1.41  -0.0827 1      
## 13 -1.47  -0.678  1      
## 14 -1.37  -0.883  1      
## 15 -1.08  -0.370  1      
## 16 -0.783 -0.658  1      
## 17 -1.50  -0.309  1      
## 18 -1.64  -0.596  1      
## 19 -0.881 -0.329  1      
## 20 -1.67  -0.801  1
summary(cluster1)
##        x                 y            cluster
##  Min.   :-1.6681   Min.   :-0.88346   1:20   
##  1st Qu.:-1.3812   1st Qu.:-0.68326   2: 0   
##  Median :-1.1271   Median :-0.61653   3: 0   
##  Mean   :-1.1386   Mean   :-0.55596   4: 0   
##  3rd Qu.:-0.8812   3rd Qu.:-0.40094          
##  Max.   :-0.6190   Max.   :-0.08268
ggplot(cluster1, aes(x = x, y = y)) + geom_point() +
  coord_cartesian(xlim = c(-2, 2), ylim = c(-2, 2))

What happens if we try to cluster with 8 centers?

fviz_cluster(kmeans(ruspini_scaled, centers = 8), data = ruspini_scaled,
  centroids = TRUE,  geom = "point", ellipse.type = "norm")

Hierarchical Clustering

dist defaults to method=“Euclidean”

d <- dist(ruspini_scaled)

We cluster using complete link

hc <- hclust(d, method = "complete")

Hierarchical clustering does not return cluster assignments but a dendrogram. The standard plot function plots the dendrogram.

plot(hc)

Use factoextra (ggplot version). We can specify the number of clusters to visualize how the dendrogram will be cut into clusters.

fviz_dend(hc, k = 4)

More plotting options for dendrograms, including plotting parts of large dendrograms can be found here.

Extract cluster assignments by cutting the dendrogram into four parts and add the cluster id to the data.

clusters <- cutree(hc, k = 4)
cluster_complete <- ruspini_scaled %>%
  add_column(cluster = factor(clusters))
cluster_complete
## # A tibble: 75 x 3
##         x      y cluster
##     <dbl>  <dbl> <fct>  
##  1 -0.258  1.17  1      
##  2  1.81   0.390 2      
##  3 -0.881  1.13  1      
##  4  1.74   0.390 2      
##  5 -0.553  1.05  1      
##  6  0.201 -1.58  3      
##  7 -1.37  -0.473 4      
##  8  0.758 -1.56  3      
##  9  1.97   0.513 2      
## 10  0.102 -1.62  3      
## # … with 65 more rows
ggplot(cluster_complete, aes(x, y, color = cluster)) +
  geom_point()

Try 8 clusters (Note: fviz_cluster needs a list with data and the cluster labels for hclust)

fviz_cluster(list(data = ruspini_scaled, cluster = cutree(hc, k = 8)), geom = "point")

Clustering with single link

hc_single <- hclust(d, method = "single")
fviz_dend(hc_single, k = 4)

fviz_cluster(list(data = ruspini_scaled, cluster = cutree(hc_single, k = 4)), geom = "point")

Density-based clustering with DBSCAN

library(dbscan)

Parameters: minPts defines how many points in the epsilon neighborhood are needed to make a point a core point. It is often chosen as a smoothing parameter. I use here minPts = 4.

To decide on epsilon, the knee in the kNN distance plot is often used. Note that minPts contains the point itself, while the k-nearest neighbor does not. We therefore have to use k = minPts - 1! The knee is around eps = .32.

kNNdistplot(ruspini_scaled, k = 3)
abline(h = .32, col = "red")

run dbscan

db <- dbscan(ruspini_scaled, eps = .32, minPts = 4)
db
## DBSCAN clustering for 75 objects.
## Parameters: eps = 0.32, minPts = 4
## The clustering contains 4 cluster(s) and 5 noise points.
## 
##  0  1  2  3  4 
##  5 23 12 15 20 
## 
## Available fields: cluster, eps, minPts
str(db)
## List of 3
##  $ cluster: int [1:75] 1 2 1 2 1 3 4 3 2 3 ...
##  $ eps    : num 0.32
##  $ minPts : num 4
##  - attr(*, "class")= chr [1:2] "dbscan_fast" "dbscan"
ggplot(ruspini_scaled %>% add_column(cluster = factor(db$cluster)),
  aes(x, y, color = cluster)) + geom_point()

Note: Cluster 0 represents outliers).

fviz_cluster(db, ruspini_scaled, geom = "point")

Play with eps (neighborhood size) and MinPts (minimum of points needed for core cluster)

Partitioning Around Medoids (PAM)

Also called \(k\)-medoids. Similar to \(k\)-means, but uses medoids instead of centroids to represent clusters and works on a precomputed distance matrix. An advantage is that you can use any distance metric not just Euclidean distances. Note: The medoid is the most central data point in the middle of the cluster.

library(cluster)
## 
## Attaching package: 'cluster'
## The following object is masked _by_ '.GlobalEnv':
## 
##     ruspini
d <- dist(ruspini_scaled)
str(d)
##  'dist' num [1:2775] 2.208 0.624 2.147 0.32 2.789 ...
##  - attr(*, "Size")= int 75
##  - attr(*, "Diag")= logi FALSE
##  - attr(*, "Upper")= logi FALSE
##  - attr(*, "method")= chr "euclidean"
##  - attr(*, "call")= language dist(x = ruspini_scaled)
p <- pam(d, k = 4)
p
## Medoids:
##      ID   
## [1,] 67 67
## [2,] 68 68
## [3,] 73 73
## [4,] 32 32
## Clustering vector:
##  [1] 1 2 1 2 1 3 4 3 2 3 1 1 3 4 3 2 1 1 1 4 1 2 1 1 1 2 2 1 4 4 3 4 2 1 1 1 2 1
## [39] 3 4 1 2 4 2 1 2 2 4 3 1 4 4 4 4 4 4 3 4 3 3 4 3 2 4 2 4 1 2 4 2 1 1 3 3 3
## Objective function:
##     build      swap 
## 0.4422977 0.3187056 
## 
## Available components:
## [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
## [6] "clusinfo"   "silinfo"    "diss"       "call"
ruspini_clustered <- ruspini_scaled %>% add_column(cluster = factor(p$cluster))

medoids <- as_tibble(ruspini_scaled[p$medoids, ], rownames = "cluster")
medoids
## # A tibble: 4 x 3
##   cluster      x      y
##   <chr>    <dbl>  <dbl>
## 1 1       -0.357  1.17 
## 2 2        1.45   0.554
## 3 3        0.463 -1.46 
## 4 4       -1.18  -0.555
ggplot(ruspini_clustered, aes(x = x, y = y, color = cluster)) + geom_point() +
  geom_point(data = medoids, aes(x = x, y = y, color = cluster), shape = 3, size = 10)

# __Note:__ `fviz_cluster` needs the original data.
fviz_cluster(c(p, list(data = ruspini_scaled)), geom = "point", ellipse.type = "norm")

Gaussian Mixture Models

library(mclust)
## Package 'mclust' version 5.4.7
## Type 'citation("mclust")' for citing this R package in publications.
## 
## Attaching package: 'mclust'
## The following object is masked from 'package:purrr':
## 
##     map

Mclust uses Bayesian Information Criterion (BIC) to find the number of clusters (model selection). BIC uses the likelihood and a penalty term to guard against overfitting.

m <- Mclust(ruspini_scaled)
summary(m)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust EEI (diagonal, equal volume and shape) model with 5 components: 
## 
##  log-likelihood  n df       BIC       ICL
##       -91.26485 75 16 -251.6095 -251.7486
## 
## Clustering table:
##  1  2  3  4  5 
## 23 14 15 20  3
plot(m, what = "classification")

Rerun with a fixed number of 4 clusters

m <- Mclust(ruspini_scaled, G=4)
summary(m)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust EEI (diagonal, equal volume and shape) model with 4 components: 
## 
##  log-likelihood  n df       BIC       ICL
##       -101.6027 75 13 -259.3327 -259.3356
## 
## Clustering table:
##  1  2  3  4 
## 23 17 15 20
plot(m, what = "classification")

Spectral clustering

Spectral clustering works by embedding the data points of the partitioning problem into the subspace of the k largest eigenvectors of a normalized affinity/kernel matrix. Then uses a simple clustering method like k-means.

library("kernlab")
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:purrr':
## 
##     cross
## The following object is masked from 'package:ggplot2':
## 
##     alpha
cluster_spec <- specc(as.matrix(ruspini_scaled), centers = 4)
cluster_spec
## Spectral Clustering object of class "specc" 
## 
##  Cluster memberships: 
##  
## 4 1 4 1 4 3 1 2 1 3 4 4 3 1 3 1 4 4 4 1 4 1 4 4 4 1 1 4 1 1 3 1 1 4 4 4 1 4 2 1 4 1 1 1 4 1 1 1 3 4 1 1 1 1 1 1 3 1 3 2 1 2 1 1 1 1 4 1 1 1 4 4 3 3 3 
##  
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  45.2704818907988 
## 
## Centers:  
##             [,1]       [,2]
## [1,]  0.03671827 -0.0848984
## [2,]  0.77436200 -1.4994402
## [3,]  0.34667765 -1.4882405
## [4,] -0.35954252  1.1091151
## 
## Cluster size:  
## [1] 37  4 11 23
## 
## Within-cluster sum of squares:  
## [1] 76.54712 21.99692 35.05788 48.78351
ggplot(ruspini_scaled %>% add_column(cluster = factor(cluster_spec)),
  aes(x, y, color = cluster)) + geom_point()

Fuzzy C-Means Clustering

The fuzzy version of the known k-means clustering algorithm. Each data point has a degree of membership to for each cluster.

library("e1071")

cluster_cmeans <- cmeans(as.matrix(ruspini_scaled), centers = 4)
cluster_cmeans
## Fuzzy c-means clustering with 4 clusters
## 
## Cluster centers:
##            x          y
## 1 -0.3763177  1.1143342
## 2  1.5047515  0.5160657
## 3 -1.1371363 -0.5549794
## 4  0.4552395 -1.4760173
## 
## Memberships:
##                  1            2            3            4
##  [1,] 9.885265e-01 4.752034e-03 0.0044840296 0.0022374725
##  [2,] 1.929312e-02 9.507897e-01 0.0106801419 0.0192370144
##  [3,] 8.622280e-01 3.625375e-02 0.0758503888 0.0256679038
##  [4,] 1.387638e-02 9.650089e-01 0.0075747401 0.0135399775
##  [5,] 9.754250e-01 7.760630e-03 0.0120646138 0.0047497722
##  [6,] 9.525816e-03 1.186792e-02 0.0254625710 0.9531436941
##  [7,] 1.709744e-02 6.485645e-03 0.9626082407 0.0138086791
##  [8,] 1.111467e-02 1.926395e-02 0.0203873470 0.9492340275
##  [9,] 3.390579e-02 9.158218e-01 0.0184329081 0.0318394736
## [10,] 1.726642e-02 2.037898e-02 0.0498112639 0.9125433394
## [11,] 9.293537e-01 1.972789e-02 0.0376049965 0.0133133957
## [12,] 9.786609e-01 6.878044e-03 0.0103262494 0.0041347678
## [13,] 1.347435e-03 1.852953e-03 0.0031667519 0.9936328602
## [14,] 2.104333e-02 1.010341e-02 0.9397670162 0.0290862501
## [15,] 1.395807e-03 2.028558e-03 0.0030790064 0.9934966285
## [16,] 5.072006e-04 9.988318e-01 0.0002500010 0.0004109695
## [17,] 8.915431e-01 5.510101e-02 0.0333622885 0.0199936193
## [18,] 7.550870e-01 1.328684e-01 0.0672230717 0.0448215158
## [19,] 9.889218e-01 4.268858e-03 0.0046269991 0.0021823233
## [20,] 9.545764e-03 4.511101e-03 0.9731671401 0.0127759949
## [21,] 9.692811e-01 1.353906e-02 0.0112519576 0.0059279337
## [22,] 4.090859e-02 8.956440e-01 0.0229240231 0.0405233977
## [23,] 7.016208e-01 1.760204e-01 0.0713878136 0.0509709905
## [24,] 9.395445e-01 2.852425e-02 0.0204612522 0.0114700219
## [25,] 9.272268e-01 3.404931e-02 0.0247841189 0.0139397901
## [26,] 1.130486e-02 9.728625e-01 0.0058699434 0.0099627198
## [27,] 2.455647e-03 9.945446e-01 0.0011598053 0.0018399705
## [28,] 9.923512e-01 2.766916e-03 0.0033847074 0.0014972235
## [29,] 2.256100e-02 7.843906e-03 0.9540755555 0.0155195399
## [30,] 2.538261e-02 9.896854e-03 0.9426476717 0.0220728668
## [31,] 1.663978e-03 2.298177e-03 0.0038610539 0.9921767906
## [32,] 4.470678e-04 1.837830e-04 0.9989327653 0.0004363839
## [33,] 2.683364e-02 9.395405e-01 0.0130809438 0.0205449219
## [34,] 7.513701e-01 1.046573e-01 0.0920670122 0.0519055390
## [35,] 9.596660e-01 1.728009e-02 0.0151635939 0.0078903202
## [36,] 9.753055e-01 8.407888e-03 0.0114698868 0.0048167650
## [37,] 9.617736e-02 8.109442e-01 0.0392596788 0.0536187445
## [38,] 9.762848e-01 9.539695e-03 0.0095407564 0.0046347839
## [39,] 2.342939e-02 4.590733e-02 0.0384835264 0.8921797562
## [40,] 3.142756e-03 1.359415e-03 0.9920953478 0.0034024817
## [41,] 9.103298e-01 2.448069e-02 0.0483842527 0.0168053090
## [42,] 1.100404e-01 7.056376e-01 0.0654648037 0.1188571912
## [43,] 3.837317e-02 2.108360e-02 0.8665088821 0.0740343467
## [44,] 2.176786e-01 4.702613e-01 0.1283049130 0.1837552004
## [45,] 9.245094e-01 2.370814e-02 0.0371532799 0.0146291803
## [46,] 8.072621e-03 9.834303e-01 0.0034557335 0.0050413069
## [47,] 1.267116e-01 7.877448e-01 0.0394162663 0.0461273897
## [48,] 4.259783e-02 2.129940e-02 0.8727576442 0.0633451278
## [49,] 1.135754e-02 1.645554e-02 0.0249345748 0.9472523483
## [50,] 9.964130e-01 1.321786e-03 0.0015600713 0.0007051195
## [51,] 2.248346e-02 1.143259e-02 0.9302636218 0.0358203222
## [52,] 9.311887e-02 4.137707e-02 0.7683801229 0.0971239450
## [53,] 9.817150e-02 2.771397e-02 0.8288388020 0.0452757287
## [54,] 2.686692e-02 1.152719e-02 0.9343457584 0.0272601350
## [55,] 3.001445e-02 1.461002e-02 0.9148706525 0.0405048734
## [56,] 1.357173e-02 4.907475e-03 0.9712822899 0.0102385008
## [57,] 1.011326e-02 1.337414e-02 0.0241499976 0.9523626002
## [58,] 3.663290e-02 1.831452e-02 0.8900885300 0.0549640545
## [59,] 1.083477e-02 1.340792e-02 0.0287382711 0.9470190470
## [60,] 1.086597e-02 1.817958e-02 0.0205929667 0.9503614841
## [61,] 5.307526e-02 1.797576e-02 0.8953158507 0.0336331290
## [62,] 1.015955e-02 1.775434e-02 0.0183427126 0.9537433908
## [63,] 7.442472e-03 9.845896e-01 0.0032207359 0.0047472284
## [64,] 4.917225e-02 1.998631e-02 0.8877334204 0.0431080224
## [65,] 1.738848e-01 5.412219e-01 0.1075064006 0.1773868267
## [66,] 4.507022e-02 1.644878e-02 0.9045021678 0.0339788343
## [67,] 9.976967e-01 8.878786e-04 0.0009642636 0.0004511864
## [68,] 1.323589e-03 9.971242e-01 0.0006089516 0.0009432799
## [69,] 5.521421e-02 2.497429e-02 0.8604287050 0.0593827986
## [70,] 1.482968e-02 9.702929e-01 0.0061526285 0.0087247516
## [71,] 9.186982e-01 2.380073e-02 0.0419824586 0.0155186487
## [72,] 9.467202e-01 1.728616e-02 0.0256384964 0.0103551011
## [73,] 5.053169e-05 7.425004e-05 0.0001096031 0.9997656152
## [74,] 3.346466e-03 4.422486e-03 0.0082407116 0.9839903362
## [75,] 9.017287e-03 1.454357e-02 0.0173387945 0.9591003500
## 
## Closest hard clustering:
##  [1] 1 2 1 2 1 4 3 4 2 4 1 1 4 3 4 2 1 1 1 3 1 2 1 1 1 2 2 1 3 3 4 3 2 1 1 1 2 1
## [39] 4 3 1 2 3 2 1 2 2 3 4 1 3 3 3 3 3 3 4 3 4 4 3 4 2 3 2 3 1 2 3 2 1 1 4 4 4
## 
## Available components:
## [1] "centers"     "size"        "cluster"     "membership"  "iter"       
## [6] "withinerror" "call"

Plot membership (shown as small pie charts)

library("scatterpie")
ggplot()  +
  geom_scatterpie(data = cbind(ruspini_scaled, cluster_cmeans$membership),
    aes(x = x, y = y), cols = colnames(cluster_cmeans$membership), legend_name = "Membership") + coord_equal()

Internal Cluster Validation

Compare the Clustering Quality

Look at the within.cluster.ss and the avg.silwidth

#library(fpc)

Notes: * I do not load fpc since the NAMESPACE overwrites dbscan. * The clustering (second argument below) has to be supplied as a vector with numbers (cluster IDs) and cannot be a factor (use as.integer() to convert the factor to an ID).

fpc::cluster.stats(d, km$cluster)
## $n
## [1] 75
## 
## $cluster.number
## [1] 4
## 
## $cluster.size
## [1] 20 15 17 23
## 
## $min.cluster.size
## [1] 15
## 
## $noisen
## [1] 0
## 
## $diameter
## [1] 1.1192822 0.8359025 1.4627043 1.1591436
## 
## $average.distance
## [1] 0.4824376 0.3564457 0.5805551 0.4286139
## 
## $median.distance
## [1] 0.4492432 0.3379729 0.5023855 0.3934100
## 
## $separation
## [1] 1.157682 1.157682 0.767612 0.767612
## 
## $average.toother
## [1] 2.157193 2.307969 2.293318 2.148527
## 
## $separation.matrix
##          [,1]     [,2]     [,3]     [,4]
## [1,] 0.000000 1.157682 1.339721 1.219930
## [2,] 1.157682 0.000000 1.308435 1.957726
## [3,] 1.339721 1.308435 0.000000 0.767612
## [4,] 1.219930 1.957726 0.767612 0.000000
## 
## $ave.between.matrix
##          [,1]     [,2]     [,3]     [,4]
## [1,] 0.000000 1.874198 2.771960 1.887363
## [2,] 1.874198 0.000000 2.220011 2.750174
## [3,] 2.771960 2.220011 0.000000 1.924915
## [4,] 1.887363 2.750174 1.924915 0.000000
## 
## $average.between
## [1] 2.219257
## 
## $average.within
## [1] 0.4629733
## 
## $n.between
## [1] 2091
## 
## $n.within
## [1] 684
## 
## $max.diameter
## [1] 1.462704
## 
## $min.separation
## [1] 0.767612
## 
## $within.cluster.ss
## [1] 10.08781
## 
## $clus.avg.silwidths
##         1         2         3         4 
## 0.7211353 0.8073733 0.6812849 0.7454551 
## 
## $avg.silwidth
## [1] 0.7368082
## 
## $g2
## NULL
## 
## $g3
## NULL
## 
## $pearsongamma
## [1] 0.8415597
## 
## $dunn
## [1] 0.5247896
## 
## $dunn2
## [1] 3.228286
## 
## $entropy
## [1] 1.37327
## 
## $wb.ratio
## [1] 0.2086163
## 
## $ch
## [1] 323.5512
## 
## $cwidegap
## [1] 0.2611701 0.2351854 0.4149825 0.3152817
## 
## $widestgap
## [1] 0.4149825
## 
## $sindex
## [1] 0.8583457
## 
## $corrected.rand
## NULL
## 
## $vi
## NULL

Read ? cluster.stats for an explanation of all the available indices.

sapply(
  list(
    km = km$cluster,
    hc_compl = cutree(hc, k = 4),
    hc_single = cutree(hc_single, k = 4)
  ),
  FUN = function(x)
    fpc::cluster.stats(d, x))[c("within.cluster.ss", "avg.silwidth"), ]
##                   km        hc_compl  hc_single
## within.cluster.ss 10.08781  10.08781  10.08781 
## avg.silwidth      0.7368082 0.7368082 0.7368082

Silhouette plot

library(cluster)
plot(silhouette(km$cluster, d))

Note: The silhouette plot does not show correctly in R Studio if you have too many objects (bars are missing). I will work when you open a new plotting device with windows(), x11() or quartz(). You can find ggplot versions of the plot.

ggplot visualization using factoextra

fviz_silhouette(silhouette(km$cluster, d))
##   cluster size ave.sil.width
## 1       1   20          0.72
## 2       2   15          0.81
## 3       3   17          0.68
## 4       4   23          0.75

Find Optimal Number of Clusters for k-means

ggplot(ruspini_scaled, aes(x, y)) + geom_point()

# We will use different methods and try 1-10 clusters.
set.seed(1234)
ks <- 2:10

Within Sum of Squares

Use within sum of squares and look for the knee (nstart = 5 repeats k-means 5 times and returns the best solution)

WSS <- sapply(ks, FUN = function(k) {
  kmeans(ruspini_scaled, centers = k, nstart = 5)$tot.withinss
  })

ggplot(as_tibble(ks, WSS), aes(ks, WSS)) + geom_line() +
  geom_vline(xintercept = 4, color = "red", linetype = 2)

Average Silhouette Width

Use average silhouette width (look for the max)

ASW <- sapply(ks, FUN=function(k) {
  fpc::cluster.stats(d, kmeans(ruspini_scaled, centers=k, nstart = 5)$cluster)$avg.silwidth
  })

best_k <- ks[which.max(ASW)]
best_k
## [1] 4
ggplot(as_tibble(ks, ASW), aes(ks, ASW)) + geom_line() +
  geom_vline(xintercept = best_k, color = "red", linetype = 2)

Dunn Index

Use Dunn index (another internal measure given by min. separation/ max. diameter)

DI <- sapply(ks, FUN=function(k) {
  fpc::cluster.stats(d, kmeans(ruspini_scaled, centers=k, nstart=5)$cluster)$dunn
})

best_k <- ks[which.max(DI)]
ggplot(as_tibble(ks, DI), aes(ks, DI)) + geom_line() +
  geom_vline(xintercept = best_k, color = "red", linetype = 2)

Gap Statistic

Compares the change in within-cluster dispersion with that expected from a null model (see ? clusGap). The default method is to choose the smallest k such that its value Gap(k) is not more than 1 standard error away from the first local maximum.

library(cluster)
k <- clusGap(ruspini_scaled, FUN = kmeans,  nstart = 10, K.max = 10)
k
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = ruspini_scaled, FUNcluster = kmeans, K.max = 10,     nstart = 10)
## B=100 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
##  --> Number of clusters (method 'firstSEmax', SE.factor=1): 4
##           logW   E.logW         gap     SE.sim
##  [1,] 3.497911 3.467106 -0.03080497 0.03570312
##  [2,] 3.073937 3.150184  0.07624761 0.03743636
##  [3,] 2.678004 2.902669  0.22466432 0.03796393
##  [4,] 2.106416 2.703517  0.59710065 0.03633061
##  [5,] 1.987213 2.569927  0.58271427 0.03474390
##  [6,] 1.863936 2.451066  0.58712975 0.03650821
##  [7,] 1.732234 2.347809  0.61557504 0.03954281
##  [8,] 1.640492 2.256196  0.61570423 0.04131903
##  [9,] 1.558316 2.172143  0.61382673 0.04090234
## [10,] 1.471552 2.092989  0.62143694 0.03933863
plot(k)

Note: these methods can also be used for hierarchical clustering.

There have been many other methods and indices proposed to determine the number of clusters. See, e.g., package NbClust.

Visualizing the Distance Matrix

ggplot(ruspini_scaled, aes(x, y, color = factor(km$cluster))) + geom_point()

d <- dist(ruspini_scaled)

Inspect the distance matrix between the first 5 objects.

as.matrix(d)[1:5, 1:5]
##           1          2         3          4         5
## 1 0.0000000 2.20786556 0.6242513 2.14665311 0.3197442
## 2 2.2078656 0.00000000 2.7880733 0.06556833 2.4502009
## 3 0.6242513 2.78807327 0.0000000 2.72490676 0.3379729
## 4 2.1466531 0.06556833 2.7249068 0.00000000 2.3870988
## 5 0.3197442 2.45020088 0.3379729 2.38709881 0.0000000

A false-color image visualizes each value in the matrix as a pixel with the color representing the value.

library(seriation)
pimage(d)

Rows and columns are the objects as they are ordered in the data set. The diagonal represents the distance between an object and itself and has by definition a distance of 0 (dark line). Visualizing the unordered distance matrix does not show much structure, but we can reorder the matrix (rows and columns) using the k-means cluster labels from cluster 1 to 4. A clear block structure representing the clusters becomes visible.

pimage(d, order=order(km$cluster))

Plot function dissplot in package seriation rearranges the matrix and adds lines and cluster labels. In the lower half of the plot, it shows average dissimilarities between clusters. The function organizes the objects by cluster and then reorders clusters and objects within clusters so that more similar objects are closer together.

dissplot(d, labels = km$cluster, options=list(main="k-means with k=4"))

The reordering by dissplot makes the misspecification of k visible as blocks.

dissplot(d, labels = kmeans(ruspini_scaled, centers = 3)$cluster)

dissplot(d, labels = kmeans(ruspini_scaled, centers = 9)$cluster)

Using factoextra

fviz_dist(d)

External Cluster Validation

External cluster validation uses ground truth information. That is, the user has an idea how the data should be grouped. This could be a known class label not provided to the clustering algorithm.

We use an artificial data set with known groups.

library(mlbench)
set.seed(1234)
shapes <- mlbench.smiley(n = 500, sd1 = 0.1, sd2 = 0.05)
plot(shapes)

Prepare data

truth <- as.integer(shapes$class)
shapes <- scale(shapes$x)
colnames(shapes) <- c("x", "y")
shapes <- as_tibble(shapes)

ggplot(shapes, aes(x, y)) + geom_point()

Find optimal number of Clusters for k-means

ks <- 2:20

Use within sum of squares (look for the knee)

WSS <- sapply(ks, FUN = function(k) {
  kmeans(shapes, centers = k, nstart = 10)$tot.withinss
})

ggplot(as_tibble(ks, WSS), aes(ks, WSS)) + geom_line()

Looks like it could be 7 clusters

km <- kmeans(shapes, centers = 7, nstart = 10)

ggplot(shapes %>% add_column(cluster = factor(km$cluster)), aes(x, y, color = cluster)) +
  geom_point()

Hierarchical clustering: We use single-link because of the mouth is non-convex and chaining may help.

d <- dist(shapes)
hc <- hclust(d, method = "single")

Find optimal number of clusters

ASW <- sapply(ks, FUN = function(k) {
  fpc::cluster.stats(d, cutree(hc, k))$avg.silwidth
})

ggplot(as_tibble(ks, ASW), aes(ks, ASW)) + geom_line()

The maximum is clearly at 4 clusters.

hc_4 <- cutree(hc, 4)

ggplot(shapes %>% add_column(cluster = factor(hc_4)), aes(x, y, color = cluster)) +
  geom_point()

Compare with ground truth with the corrected (=adjusted) Rand index (ARI), the variation of information (VI) index, entropy and purity.

cluster_stats computes ARI and VI as comparative measures. I define functions for entropy and purity here:

entropy <- function(cluster, truth) {
  k <- max(cluster, truth)
  cluster <- factor(cluster, levels = 1:k)
  truth <- factor(truth, levels = 1:k)
  w <- table(cluster)/length(cluster)

  cnts <- sapply(split(truth, cluster), table)
  p <- sweep(cnts, 1, rowSums(cnts), "/")
  p[is.nan(p)] <- 0
  e <- -p * log(p, 2)

  sum(w * rowSums(e, na.rm = TRUE))
}

purity <- function(cluster, truth) {
  k <- max(cluster, truth)
  cluster <- factor(cluster, levels = 1:k)
  truth <- factor(truth, levels = 1:k)
  w <- table(cluster)/length(cluster)

  cnts <- sapply(split(truth, cluster), table)
  p <- sweep(cnts, 1, rowSums(cnts), "/")
  p[is.nan(p)] <- 0

  sum(w * apply(p, 1, max))
}

calculate measures (for comparison we also use random “clusterings” with 4 and 6 clusters)

random_4 <- sample(1:4, nrow(shapes), replace = TRUE)
random_6 <- sample(1:6, nrow(shapes), replace = TRUE)

r <- rbind(
  kmeans_7 = c(
    unlist(fpc::cluster.stats(d, km$cluster, truth, compareonly = TRUE)),
    entropy = entropy(km$cluster, truth),
    purity = purity(km$cluster, truth)
    ),
  hc_4 = c(
    unlist(fpc::cluster.stats(d, hc_4, truth, compareonly = TRUE)),
    entropy = entropy(hc_4, truth),
    purity = purity(hc_4, truth)
    ),
  random_4 = c(
    unlist(fpc::cluster.stats(d, random_4, truth, compareonly = TRUE)),
    entropy = entropy(random_4, truth),
    purity = purity(random_4, truth)
    ),
  random_6 = c(
    unlist(fpc::cluster.stats(d, random_6, truth, compareonly = TRUE)),
    entropy = entropy(random_6, truth),
    purity = purity(random_6, truth)
    )
  )
r
##          corrected.rand        vi   entropy    purity
## kmeans_7    0.638228776 0.5708767 0.2285591 0.4638565
## hc_4        1.000000000 0.0000000 0.0000000 1.0000000
## random_4   -0.003235188 2.6831737 1.9883398 0.2878286
## random_6   -0.002125389 3.0762751 1.7280983 0.1436265

Notes:

Read ? cluster.stats for an explanation of all the available indices.